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CLUSTER CONFIGURATION REPOSITORY 
CROSS-REFERENCE TO RELATED APPLICATIONS 
This application claims the benefit of U.S. Provisional Patent Application number 
60/201,209 filed May 2, 2000, and entitled "Cluster Configuration Repository," and U.S. 
5 Provisional Application number 60/201 ,099, filed May 2, 2000, and entitled "Cairier Grade Hi^h 
Availability Platform", which are hereby incorporated by reference. 

BACKGROUND OF THE INVENTION 

Field of the Invention 

1 0 This invention relates to data management for a carrier-grade high availabihty platform, 

and more particularly, to a repositoiy system and method for the maintenance of, and access to^ 
cluster configuration data in real-time. 
Discussion of the Related Art 

High availability computer systems provide basic and real-time computing services. In 

15 order to provide highly available services, peers in the system must have access to, or be capable 
of having access to, configxiration data in real-time. 

Computer networks allow data and services to be distributed among computer systems. 
A clustered network provides a network with system services, applications and hardware divided 
into nodes that can join or leave a cluster as is necessary. A clustered high availability computer 

20 system must maintain cluster data in order to provide services in real-time. Generally this 
creates large overhead and commitment of system resources and the need for additional 
hardware to provide the high speed access necessary. The additional hardware and system 
complexity can ultimately slow system performance. System costs are also increased by the 
hardware and complex software additions. 

25 SUMMARY OF THE INVENTION 

The present invention is directed to a system for providing real-time cluster configuration 
data within a clustered computer network that substantially obviates one or more of the problems 
due to limitations and disadvantages of the related art. An object of the present invention is to 
provide an innovative system and method for providing real-time storage and retrieval of cluster 

30 configuration data and real-time recovery capabilities in the event a master node of a cluster, or 
its configuration data, is inaccessible due to failure or corruption. 
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It is therefore an object of the present invention to provide real-time access and retrieval 
of cluster configuration data. 

It is also an object of the present invention to provide primary and secondary repositories 
and repository managers to eliminate down time from a single-point-of- failure. 
5 A further object of the present invention is the ability for external management and 

configuration operations to be initiated merely by updating the information kept in the 
repository. For example, an apphcation can register its interest in specific information kept in 
the repository aad will then be automatically notified whenever any changes in that data occur. 

Another object of the present invention is to allow the repository to be used by the high 
10 availabiUty aware applications as a highly available, distributed, persistent storage facility for 
slow-changing application/device state information (such as calibration data, software version 
information, health history, and administrative states). 

Additional features and advantages of the invention will be set forth in the description, 
which follows, and in part will be apparent from the description, or may be learned by practice of 
15 the invention. The objectives and other advantages of the invention will be realized and attained 
by the structure particularly pointed out in the written description and claims hereof as well as 
the appended drawings. 

To achieve these other advantages and in accordance with the purpose of the present 
invention, as embodied and broadly described, the system for providing real-time cluster 
20 configuration data within a clustered computer network includes a plurality of clusters, including 
a primary node in each cluster wherein said primary node includes a primary repository manager, 
a secondary node in each cluster wherein said secondary node includes a secondary repository 
manager, and wherein said secondary repository manager cooperates with said primary 
repository manager to maintain information at said secondary node consistent with information 
25 maintained at said primary node. 

In another aspect, a method of providing real-time cluster configuration data within a 
clustered computer network including a pluraUty of clusters, including the steps of choosing a 
primary node in each cluster wherein the primary node includes a primary repository manager, 
choosing a secondary node in each cluster wherein the secondary node includes a secondary 
30 repository manager, and causing the secondary repository manager to cooperate with the primary 
repository manager to maintain information at the secondary node consistent with information 
maintained at the primary node. 
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In another aspect, a computer program product including a computer useable medium 
having computer readable code embodied therein for providing real-time cluster configuration 
data within a clustered computer network including a pluraUty of clusters, the computer program 
product adapted when run on a computer to effect steps including choosing a primary node in 
5 each cluster wherein the primary node includes a primary repository manager, choosing a 
secondary node in each cluster wherein the secondary node includes a secondary repository 
manager, and causing the secondary repository manager to cooperate with the primary repository 
manager to maintain information at the secondary node consistent with information maintained 
at the primary node. 

10 In a further aspect, a computer program product including a computer useable medium 

having computer readable code embodied therein for providing real-time cluster configuration 
data within a clustered computer network comprising a plurality of clusters, the computer 
program product including means for choosing a primary node in each cluster wherein the 
primary node includes a primary repository manager, means for choosing a secondary node in 

15 each cluster wherein the secondary node includes a secondary repository manager, and means for 
causing the secondary repository manager to cooperate with the primary repository manager to 
maintain information at the secondary node consistent with information maintained at the 
primary node. 

Thus, in accordance with an aspect of the invention, a cluster configuration repository is a 
20 software component of a carrier-grade high availabiUty platform. The repository provides the 
capabiKty of storing and retrieving configuration data in real-time. The repository is a highly 
available service and it is distributed on a cluster. It also supports redundant persistent storage 
devices, such as disks or flash RAM. The repository further provides applications with a simple 
application programming interface (API). The primitives are essentially elementary record- 
25 oriented data management functions: creation, destruction, update and retrieval. 

It is to be understood that both the foregoing general description and the following 
detailed description are exemplary and explanatory and are intended to provide further 
explanation of the invention as claimed. 

BRIEF DESCRIPTION OF THE DRAWINGS 
30 The accompanying drawings, which are included to provide a further understanding of 

the invention and are incorporated in and constitute a part of this specification, illustrate 
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embodiments of the invention and together with the description serve to explain the principles of 
the invention. In the drawings: 

FIG. 1 is a diagram illustrating a clustered high availability network. 
FIG. 2 is a diagram illustrating a single cluster with n-nodes, including a primary and 
5 secondary node. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 
Reference will now be made in detail to the preferred embodiments of the present 
invention, examples of which are illustrated in the accompanying drawings. 

The present invention, referred to in this embodiment as the cluster configuration 
10 repository, is a software component of a carrier-grade high availability platform, A main 

purpose is to provide real-time retrieval of configuration data firom anywhere within the cluster. 
The cluster configuration repository is a fast, lightweight, and highly available persistent 
database that is distributed on the cluster and allows data in various forms such as structure, and 
table to be stored and retrieved. Using a carrier-grade high availability event-service, the cluster 
1 5 configuration repository can also notify applications whenever repository data is modified. In 
addition, it can support redundant persistent storage devices, such as disks or flash RAM. 

The cluster configuration repository also provides applications within the cluster a simple 
API. The primitives are essentially elementary record-oriented data management fimctions such 
as creation, destruction, update and retrieval. In order to satisfy the performance requirements 
20 for some time-critical cluster configuration repository services, the cluster configuration 

repository ojSers two types of APIs: a common base API, and a real-time API. The common 
base API set includes a set of primitives that are not performance-critical. The real-time API, on 
the other hand, guarantees high performance for read operations of repository data. 

The cluster configuration repository must be highly available within the carrier-grade 
25 high availability platform. To support such a requirement, the cluster configuration repository 
managers should be available in a primary/secondary mode to eliminate the possibility of the 
single-point-of-failure. This primary/secondary configuration aUows a secondary instance of the 
repository to always be available to replace the master repository, should it ever fail. 

A key role of the cluster configuration repository in the carrier-grade high availability 
30 platform is that many external management and configuration operations can be initiated merely 
by updating the information kept in the cluster configuration repository. An application can 
register its interest in specific information kept in the cluster configuration repository and will 
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then be automatically notified whenever any changes in that data occurs. Following the 
notification, the application can take appropriate actions. 

Aside from storing configuration data, the cluster configuration repository can also be 
used by the HA-aware applications as a highly available, distributed, persistent storage facility 
5 for slow-changing appUcation/device state information (such as cahbration data, software 
version information, health history, and administrative states). 

Referring to FIG. 1, a highly available network 10 is divided into clusters 20, 30 and 40. 
Each cluster 20, 30, and 40 are organizations of nodes 22. Nodes 22 are organized vrithin each 
cluster to provide highly available software appUcations and access to hardware. 

1 0 Referring to FIG. 2, a cluster in the present invention is normally made up of at least a 

primary node 50 and a secondary node 60. Core cluster services are provided as primary 
services 56. A back-up copy is provided as secondary services 66. Within these copies of core 
cluster services the primary services 56 include the primary repository manager 52 and the 
secondary services 66 include the secondary repository manager 62. The primary services 56, 

15 including the primary repository manager 52, are generally located on the primary node 50. The 
primary repository manager 52 is responsible for: managing the persistent storage of the 
repository data on disk; maintaining an in-memory copy of the entire repository to guarantee 
high-performance for read operations; and synchronizing the repository updates. 

The secondary repository manager 62, on the other hand, is generally located on the 

20 secondary node 60 and keeps both an in-memory copy of the repository data 64 and a disk copy 
of the repository data, each synchronized with those maintained by the primary manager. This 
implies that the secondary manager maintains its own persistent data store. The two repository 
managers 52 and 62 cooperate to (1) provide highly-available repository services, and (2) make 
sure that when the primary manager fails, the secondary manager will have consistent and up-to- 

25 date repository information to continue offering the cluster configuration repository services to 
its clients. 

Repository managers 52 and 62 run on two nodes 50 and 60 (with access to local disks) 
of the cluster. Each of the remaining nodes 70 run a repository agent 72 that interfaces with the 
primary repository manager 52 to serve its local clients. Therefore, the cluster configuration 
30 repository clients, other than clients on nodes 50 and 60, never interact directly with the 

repository managers 52 and 62. They always contact the local repository agent 72 to get the 
cluster configuration repository services. Each repository agent 72 handles an in-memory 
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software cache of priority repository data and can handle read requests by itself. However, to 
ensure proper serialization among concurrent updates, all repository data updates are managed 
by the primary repository manager 52 only. The repository agents 72, thereby, forward all 
write/update requests to the primary repository manager 52. 

5 An important requirement of the cluster configuration repository service is to guarantee 

the consistency of the information kept by the two repository managers 52 and 62. This 
requirement remains even in the presence of undesirable events such as a failure of a repository 
manager 52 or 62, as well as failiires ici the data repositories 54 or 64. The cluster configuration 
repository design satisfies this requirement by enforcing "all or nothing" write semantics. The 

[ 0 client sends the data to be written/updated to the primary manager 52 only. The primary 
manager 52 works with its secondary manager 62 coimterpart and validates the successfiil 
completion of a write operation only when both primary manager 52 and secondary manager 62 
have succeeded the operation. In case of failure of one manager, the other manager rolls back 
the effect of the operation and returns its repository to the state prior to the initiation of write 

L5 operation. 

The primary repository manager 52 and secondary repository manager 62 support the 
cluster configuration repository services in a highly available manner. There can be several 
ways of assigning these repository managers to the cluster nodes. The following approach is a 
preferred embodiment. 

iO The carrier-grade high availability platform has various primary services 56 including the 

cluster configuration repository that must be available in the form of primary/secondary 56 and 
66, e.g., the Component Instance/Role Manager (CRIM). It is desirable that the primary 
instances of these services 56 are co-located in the same node. It is also desirable that the 
secondary instances 66 are co-located on a node as weU. The best possible location for the 

15 primary instances of these services 56 is the master or primary node 50 of the cluster. The 
carrier-grade high availabiUty platform includes a cluster membership monitor to monitor 
removal and joining of nodes into clusters (due to failure, repair completion, or addition of a new 
node). The cluster membership monitor elects two nodes with special responsibiUties; (1) 
Primary (Master) node 50, and (2) Secondary (Vice Master) node 60. It is preferred to assign the 

\0 master node 50 to run the primary instances for all system services. 

The secondary node 60 (which is an already-elected backup for the primary node 50) is 
also a preferred location to run the secondary instances of these services 66. When the primary 

6 



wo 01/84338 PCT/USOl/14228 

instance of any of these services fails, this failure will be interpreted that the primary node SO is 
incapable of hosting carrier-grade high availabihty system services, meaning that all primary 
instances of system services 56 should be failed over to the secondary node 60. In other words, 
after a failure in any of the system services 56, the cluster membership monitor will be notified 
5 to switch over the master role to the secondary node 60. Then, the cluster membership monitor 
will elect a new secondary master node and secondary instances of the system services will be 
recreated in the newly elected secondary master node. 

The primary repository manager 52 runs in the master node 50 of the cluster, and the 
secondary repository manager 62 runs in the secondary node 60. In other words, the failure of 

10 the primary repository manager is translated to the failure of the master node and will be handled 
in that context. However, the cluster configuration repository should include mechanisms for 
handling various failures of its components. 

When a cluster configuration repository service is started (for example during cluster 
initialization), it will start with an empty repository. The repository can then be populated 

15 through OAM&P (Operation, Administration, Maintenance and Provisioning). However, a 
second embodiment provides that some minimal repository information is included in the boot 
image where it can be used as the initial repository. The initial repository can, for example, 
include the infonnation about the configuration of other essential carrier-grade high availability 
system services. The rest of the repository can be built later with the help of the cUents 

20 tiiemselves or OAM&P. 

There are two possible upgrade styles during a software upgrade process: (i) a rolling 
upgrade, and (ii) a split-mode upgrade. During a rolling upgrade the services are being upgraded 
incrementally (one node at a time), thus, no specific protocol is needed to keep the cluster 
configuration repository service available to the whole cluster. However, during the split-mode 

25 upgrade the cluster is divided into two semi-clusters; one running the new release (new domain), 
and the other running the previous release (old domain). It is then inevitable to have two disjoint 
cluster configuration repository services, one for each domain. The cluster configuration 
repository supporting the new domain will initialize its repository using the same process as the 
cluster configuration repository initialization discussed earlier. This newly created cluster 

30 configuration repository will be initialized using the repository information in the boot-image or 
through OAM&P. It is important to notice that there are no automatic repository data exchanges 
between the two cluster configuration repositories. The new cluster configuration repository 
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populates its data directly from the client or OAM&P, but not from the cluster configuration 
repository of the old domain. After the completion of the upgrade process, the cluster 
configuration repository representing the old domain dies out. 

In a preferred embodiment, clients view the repository data as a set of tables. A table is 
5 represented as a regular file on a Unix-like file system. Each record (i.e., a row in the table) is 
accessed through a primary key. A hashing technique is used to map the given key into the 
location of the corresponding record. A table is represented in memory as a set of chunks. A 
chunk is a set of contiguous bytes and can be dynamically allocated/de-allocated to a table on an 
as needed basis. 

10 If a table is opened in a node with the cached option, it will be cached in the address 

space of the local repository agent when the table is accessed for the first time. To fiirther 
enhance the performance of read operations, the cluster configuration repository maps the cached 
table to the corresponding appUcation address spaces using POSEK-Uke shared memory facihty. 
The cluster configuration repository is organized as a set of data tables, which can be 

15 accessed in a consistent manner from any node in the cluster. At creation time, it can be 
requested that a table be persistent. The table is then kept on redundant persistent storage 
devices. Tables are created with a given initial size that determines the number of pre-allocated 
records. This policy has been chosen to enstire that the minimal set of vital resources can be pre- 
allocated at creation time. Tables may grow dynamically after creation if the necessary 

20 resources (i.e. memory and storage space) are still available. 

The name space of the tables is the global name space also used by event channels and 
checkpoints. Tables are referred to by their context and name, which are managed by the 
Naming Service through the naming API. The name server entry of a table created as persistent 
is also persistent. 

25 Each table of the cluster configuration repository contains records of the same 

composition. A record is composed of a set of columns. Each columa is represented by a unique 
name, which is a string. The value of a column may be of the following types: signed and 
unsigned number types (8, 16, 32 and 64 bit), string (fixed size array of ASCII characters), and 
fixed-length raw data. A string is null-terminated, therefore its length (the nimiber of characters 

30 before the null character) is variable and may be less than the size of the array which is fixed 
and corresponds to the resources allocated for the string. 
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By construction, records in a given table all have the same fixed size. The API design 
assumes that the record size is between 4 bytes aad 4 Kbytes, but does not exclude larger sizes. 
As records in a given table have the same composition, this composition is also called the record 
format or the table schema, 

5 The cluster configuration repository identifies records using keys. One specific column 

of the record format is the record key. This particular column must be of type string, and its 
value is the unique identifier of a record within the table. The only way to search for a given 
record is by specifying its key. Each table is created with an associated hash index used to 
perform these lookups. The number of hash buckets (size of the index) can be specified at the 

. 0 time the table is created. 

The repository allows an appUcation to obtain a private copy of a record. The API 
supports the retrieval of any nxmiber of columns of a given record. This helps optimizing the 
access cost by avoiding the transfer of an entire (potentially large) record. 

A record is created by writing a record with a key which does not exist in the table yet. 

. 5 Two local or concurrent write operations of the same record are serialized at some point (no 
interleaving occurs). When a record write operation successfully returns, the record has been 
committed to the redundant persistent storage. Subsequent reads of that record on any node 
return the updated dataj If the write fails, the cluster configuration repository guarantees that the 
record has not been committed. If a read operation is issued concurrently with a write, it returns 

to either the old values (before the modification) or the new values (after the modification), but not 
a mix of old and new values. 

The repository also supports updates of any number of columns of a record. This means 
the whole record doesn't have to be rewritten just to update one colimin. In general, change to 
the repository incxirs a notification to the appUcations in the form on an event. 

15 A bulk update is an operation in which a large number of modifications are done to the 

repository. In order to optimize the cost of this operation, the process issues a bulk update start 
request, makes the individual modifications, then issues a bulk update end operation. 

After the start primitive, the modifications are done using the usual primitives of the API. 
However, their effects may not be propagated throughout the cluster upon return firom these 

iO primitives. This means that read operations on some other nodes may return the values as they 
were before the modifications. To simplify the management of concurrent modifications by 
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Other applications, only one process in the cluster can engage a bulk update at a time, and other 

processes will get an error code if they issue a bulk update start request. 

When a process starts a bulk update, it specifies whether updates firom other processes are 

still possible. If they are, individual writes can be interleaved within a bulk update without 
5 compromising the atomicity of any writes and reads. The only difference is that individual 

updates are immediately propagated throughout the cluster. 

The end primitive completes the bulk update operations, previously started by the same 

process. It returns when all the modifications issued subsequent to the start point are propagated 

throughout the cluster. It also allows a new buUc update to be issued. If a process terminates for 
0 any reason (e.g., exit or crash) and it was in the middle of a bulk update operation, an implicit 

bulk update end operation is performed. The update operations already performed remain vaUd 

(no rollback). 

In contrast to the non-bulk update operations, modifications made within a bulk update 
do not generate events, only one notification event is sent after the bulk update end is issued. If 

5 the bulk update end was made impHcitly by the cluster configuration repository (i.e. the 

application process crashes), a special event is sent to tell applications that the bulk update is 
over.and that it did not finish as planned. 

Applications with critical time constraints require read access in a few hxmdxeds of 
nanoseconds. A real-time API is provided to provide applications with faster mechanisms to 

;0 retrieve data. It introduces new objects such as handles, colunm ids and links. It does not 
provide a real-time write operation. 

An application accessing a table for .the first time can request that the table be cached. If 
real-time access is required, the data needs to be present in memory on the local node, therefore 
the table must be cached. Using the real-time API on a non-cached table returns an error. 

:5 Caching a table has an impact on the memory consumption of both the local node and the 

main server. On the local node, the cache is populated on a per-request basis. Therefore, records 
that haven't been read once are not present on the local node and need to be fetched firom the 
main server the first time they are accessed. On the main server, if the table is opened as cached, 
the fiill table is loaded into memory when the open call is performed. It is unloaded firom 

■0 memory when the table is closed. In other words, if the table is not cached, there is no in- 
memory representation of the table on the cluster and all operations must be performed on the 
persistent storage. It can be seen as a trade-off between performances and memory consumption. 

10 
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If a table is opened by multiple processes, but only one wants it cached, caching has 
priority. In such a case, the table is loaded in memory of the main server and cached on the node 
where that application is running. Memory for the cache and on the main server is freed when 
the last ^plication requesting caching closes the table. 
5 Handles can be used by the application to memorize the result of a record lookup. 

Applications aware of the real-time API can then use handles to retrieve or update data once the 
cost of the initial lookup has been paid. 

As a key is a string, columns of string type may contain keys to express persistent 
relations between records. Such columns are called hnks. Cross-table relations can be expressed 
10 using Unks, with the assumption that the related table names are known by the application and 
expUcitly passed to the cluster configuration repository API. 

The cluster configuration repository basic API uses the keys to express persistent 
references to other records. The real-time API of the cluster configuration repository internally 
associates a link to each one of these keys. The initial state of a link is "unresolved,'* a lookup 
15 operation is required to resolve the Unk by using its associated key. Once resolved, links allow 
the process to access data without performing a lookup, just as handles do. As opposed to 
handles, links are internal cluster configuration repository entities that cannot be accessed or 
copied into the process address space. 

Accessing the repository through the real-time API is a two step process: 1) look up the 
20 repository using a key value to obtain a handle; and 2) use the handle to access the designated 
record. 

The provided real-time API fimctions have the same semantics as the equivalent basic 
versions. They may return an ESTALE error condition when used with a handle corresponding 
to a deleted record. A real-time retrieve operation may return EWOULDBLOCK if the data to 
25 be read is not in memory on the local node yet. 

In the basic API, the cluster configuration repository recognizes elementary, string and 
raw data types. The colunms composing a record are considered as occupying a row in the table. 
Rows all have the same composition and are described by the data schema for a given table. 
The following example illxistrates what a schema definition looks Uke: 
30 <cluster configuration repositoryTBL name="usertable" key="channel"> 

<COLname=*' channel" title="Channel Name". type="char" size="12" /> 
<COL name="fi:equency" title=" Channel Frequency" type="int32 J" /> 
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<COL iiame="category' 



title=" Category Name"- 



type="char" size="10" /> 
type="aiintl6j" /> 
type="uiiit8j'*size="12" /> 



<COL name="flags" 
<COL naine=" encrypt' 



title=" Attributes' 



.11 



title="Encryption key' 



<COL name="sector' 



.11 



title="Sector' 



.11 



type="char" 



size ="14" /> 



5 ^/cluster configuration repositoryTBL> 

The declaration key="channer' indicates that the column named channel is the key of the 
record. The attribute title is optional and can be used to add a text description for the column. 
The attribute type is one of the supported data types, as described above. If the field is an array, 
its size is specified by the option attribute size (default value is 1). 
10 The above XML definition corresponds to the following table structure: 



channel 
(12 char) 


frequency 
(int32_t) 


category 
(10 char) 


flags 

(uintl 6_t) 


encrypt 
(uint8_t) 


sector 
(14 char) 



























A schema is provided as ASCII text. A parser reads the text and decodes the 
composition. 

Within the cluster configuration repository, data is organized in tables. The table 
15 identifier type used by the API is ccr_table__t. One entry of a given table is a record. All the 
records included in a table share the same type and size specified when the table is created. 

Tables are referred to by their context and name within the global name space. 
An empty table can be created by using the ccr_table_createO primitive, ctx specifies the 
context where the table is to be created and table_name is the name of the table in that context. 
20 The client must have write permission for the context ctx. The schema parameter points to a 
buffer containing the schema text. The parameter specifies the number of pre-allocated records 
in the table and the number of hash buckets used to index the table. If the operation is 
successfiil, the table is created and desc is its identifier. The ccr_table_create0 call is blocking. 



25 client must have write permission for the context ctx. This operation will effectively remove the 
table data when the table is no longer open by any processes. 



The ccr_table_unHnkO primitive deletes the table tabljaame in the context ctx. The 
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The ccr_table_openO primitive gives access to the table table^name in the context ctx. If 
the operation is successful, the table identifier is returned in desc. This identifier's scope is the 
process calling this primitive. This call is blocking. 

The ccr_table_closeO primitive removes the access to the table specified by desc. In 
5 other words, after this operation, subsequent operations usiag desc or its associated handles 
return an error. 

The ccr_statO primitive fills in the stat structure with information about the table 
specified by desc. The fields uid, gid are the credentials of the creator of the table, and mode is 
the protection mode specified during creation, flags is the current flag status of the table. Part of 

10 it is inherited from creation (0__PERSISTENT), part is dynamic (0_C ACHED), rows is the 
number of records in the table. If the table is persistent and stored on disk, disk^size is the 
number of bytes occupied by the image of the table on the file system. If there is no image of the 
table on a file system, disk_size is set to 0. schema__si2e is the size in bytes of the XML text 
describing the schema of the table, 

15 The ccr_get_XMLO primitive returns in the buffer xml_buffer of size buffer_size the 

ASCII text describing the schema of the table specified by desc, as passed during the 
ccr_table_createO call. The buffer xml_buffer must be large enough to receive the faU text. The 
ccr__statO call can return the size of the XML schema description. 

Records may be retrieved firom the repository by using their key. Columns of a given 

20 record can be retrieved by specifying the key of the record and the names of the columns, in any 
order. The operation is non-blocking. 

The ccr_recordJkgetO primitive finds in the table specified by desc the record whose key 
value matches the key parameter and if foxmd, copies in the locations pointed by the 
column_values array the values of the columns specified by the column_names array. The 

25 column names in this array must be column names defined by the table schema, in any order. 
This primitive is blocking. 

A "put" operation takes a number of columns of a single record and commits them to a 
given cluster configuration repository table. Atomicity is ensured on a per-record basis. First, a 
lookup is performed to find out if another record with the same key abready exists. If such a 

30 record exists, it is overwritten. If it does not exist and specific arguments are given, a new record 
(new row) is created, and this may result in a memory (and storage space) allocation operation. 
In the new record, the columns not specified in the put operation are initiahzed to default values: 

13 
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integer types have a default value of 0, raw data jSlled with 0 and strings have 0 as first character 
(empty strings). A "put" operation is blocking and returns only when the write is comnutted to 
the repository. From the return of the call and on, read operations are guaranteed to return the 
updated values. 

5 The ccr_record_kputO primitive commits new column values of a record to the cluster 

configuration repository. Atomicity is guaranteed on a per-record basis. The desc parameter 
specifies the table of data previously opened. By default, ccr_record_kputO is used to update 
existing records, but it can also be used to create a new record by passing a new key and setting 
the bit CCRJPUT^EXCREAT of the parameter put_flags. 

0 Record destruction is performed by calling the ccr_record_deleteO primitive. When 

ccr_record_deleteO returns, the record specified by its key has been removed fi:om the cluster 
configuration repository table. The ccr_record_deleteO primitive removes the data record 
identified by key from the repository. Handles associated to the record become obsolete. A call 
to the ccr_record_deleteO primitive is blocking. 

5 The cluster configuration repository pubUshes events on event channels upon 

modifications to tables of the repository. Using the event API, an application can subscribe to an 
event channel to be notified of table changes. There is at most one event channel where the 
cluster configuration repository pubhshes notifications for a given table. An application can ask 
the cluster configuration repository what the channel for a particular table is, provided it has read 

0 permission on the table. An application can set the event channel used for the notifications on a 
table (it associates an even channel to a table). It needs to have read and write permissions on the 
table to do so. 

Event channels are managed by the applications (creation, deletion, etc, . .), therefore 
access permissions to the channel are up to the application which creates it. As event channels 
5 are global to the cluster, if an apphcation sets the event channel for a table, other appHcations on 
other nodes can see it and subscribe to it (if they have the proper permissions). The same event 
channel can be used for the notifications of several tables. 

The cluster configuration repository exports a default notification channel. This well- 
known channel allows to avoid an unnecessary channel declaration when notifications on a given 
0 table are not subject to any visibility restriction. 

When an application removes the association between a table and a channel, an event of 
type CCR_NOTIFICATION_END is published to notify all the subscribers. It is up to the 
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subscribers to stop listening, set a new channel for the table, or ask the cluster configuration 
repository if a new association has been made. 

The ccr_chaimel_getO primitive returns in the bufifer channel the fiill name of the event 
channel where the cluster configuration repository publishes notifications of table changes for 
5 the table called table_name in the context ctx. Ifthere is no current association, an error is 

returned. Processes can then subscribe to the channel to start receiving notifications. The caller 
must have read permissions on the table. The maximum size of the chaimel name is the 
maximum size of a compoimd name as defined in the naming API. The call is blocking. 

Upon return firom the ccr_chamael_setO primitive, the cliister configuration repository 
0 pubhshes on the channel events related to the table table_name in the context ctx. The specified 
event channel must have been created before and the caller must have read and write permissions 
on the table. If an event channel is ahready associated to the table and channel is not 
CCR_NO_CHANNEL, an error is returned. 

Failure of the event subsystem may prevent a notification of record change firom being 
5 deUvered to a subscribing application. In such cases, the application will eventually receive a 
notification that an event about a change to table X may have been lost. Then it is up to the 
application to check whether the records it is interested in in table X have changed. 

The real-time API should only be used on nodes where the accessed tables are cached. 
Data retrieval may be performed ia 2 phases: 1) lookup phase and 2) actual read phase. 
0 During lookup phase, the application specifies the table descriptor and the key of the record it 
wants to access to obtain a handle on this record. A handle is therefore specific to a table 
descriptor. Also, it obtains a column identifier firom the column name. A handle and a column 
ID defines a "cell" in the table. During the second phase, the application needs to use the RT 
API to obtain the content of the cell. 
5 A colunm ID cannot become stale (unless the table is deleted and re-created), whereas a 

handle can become stale (when the record is deleted). For a handle to be vahd, the table needs to 
be open. When the table is closed, all handles on that table become immediately stale. 

The ccr_handle_getO primitive perfomas a lookup in the table specified by desc and 
returns a handle to the record specified by key. This call is blocking. 
0 The ccr_handle_statusO primitive checks the status of hdl. This call is non-blocking. 

The ccr__cid_getO prunitive provides column identifiers firom column names in the table 
specified by desc. 
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The ccr_record_hgetO primitive copies from the record specified by hdl the value of the 
columns specified by cid to the location specified by the column_value pointer array. 

The ccr_record_hputO primitive writes at the columns specified by the cid array of the 
record specified by hdl with the values at the locations pointed by colunm_value. Though it uses 

5 handles and column identifiers, the ccr_record_hputO primitive is blocking and does not provide 
a real-time write operation. 

Liiiks are used to express references between records of possibly different tables. Links 
are a cluster configuration repository internal optimization which allows the repository to 
memorize the result of a lookup on one node. 

0 The ccr_link_resolveO primitive performs a lookup to find in the table identified by 

destTable the record whose key value is the one in the record specified by srcHdl at the colunm 
specified by srcCid. The result of the lookup is stored in the administrative data of the cluster 
configuration repository and it will be used to avoid fiirther lookups if ccr_link_resolveO is 
called again from any process on the same node. This call is blocking. 

5 The ccr_bulkupdate_startO primitive indicates that the process will subsequently issue 

several modifications to the repository. The caller may prevent other processes from making any 
updates by setting the writer parameter accordiagly. The modifications are done using the usual 
primitives as described above. However, their effects may not be propagated immediately 
throughout the cluster upon retum from these primitives. This means that read operations on 

0 some nodes may retum the values as they were before the modifications. During a bulk update, 
notifications are not sent by the update operations. 

To simplify the management of concurrent modifications by other processes, only one 
process in the cluster can engage a bulk update at a time, and other processes will get an error 
code when calling this primitive. 

5 The ccr_buUcupdate_endO primitive completes the bulk update operation, previously 

started by the same process by calling ccr_bulkupdate_startO- It returns when all the 
modifications issues after the ccr__bulkupdate_startO are effective in the cluster, and a bulk 
update event is sent. It also allows a new bulk update to be issued. If a process terminates for 
any reason (exit or crash) and it was in the middle of a bulk update operation, an implicit bulk 

0 update end operation is performed. 

The browsing API allows exploration of a table. Starting fix>m the beginning of the table, 
it returns an array of keys of existing records. Then assuming the table schema is known, the 
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record content can be read using the get primitives. Successive calls to a browsing primitive 
start at the location of the table where the previous call finished. 

The ccr_table_listO primitive initiahzes the browser structure for browsing the table 
specified by desc. After initialization, browsing starts at the beginning of the table. 

The ccr_browse_nextO primitive copies keys of existing records to the buffer specified 
by buffer, firom the table and starting firom a position implicitly defined by browser. It writes at 
most count keys in the buffer, or stops if the end of the table is reached. Keys are strings, 
therefore their length is variable, but all keys take the same space in the buffer, which is the 
maximum size for a key defined in the table schema. The return value is the actual number of 
keys written to the buffer. The browser structure is updated to the new browsing state. 

The ccr-debug utility is a command-line tool to analyze a table representation in memory 
and on a disk (if applicable). It allows to detect and correct any anomalies in a table content. 
ccr_debug interacts with the cluster configuration repository to execute the command on the 
record designated by key of the table table_name. 

It is be apparent to those skilled in the art that various modifications and variations can be 
made in the system for providing real-time cluster configuration data within a clustered computer 
network of the present invention without departing firom the spirit or scope of the invention. 
Thus, it is intended that the present iavention cover the modifications and variations of this 
invention provided they come within the scope of the appended claims and their equivalents. 
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CLAIMS 

What is claimed is: 

1 . A system for providing real-time cluster configuratioii data within a clustered computer 
network comprising a plurality of clusters, comprising: 

a primary node in each cluster wherein said primary node includes a primary repository 
manager; 

a secondary node in each cluster wherein said secondary node includes a secondary 
repository manager; and 

wherein said secondary repository manager cooperates with said primary repository 
manager to maintain information at said secondary node consistent with information maintained 
at said primary node. 

2. The system of claim 1, wherein said primary node further comprises a primary data 
repository and primary services. 

3. The system of claim 2, wherein said secondary node further comprises a secondary data 
repository and secondary services. 

4. The system of claim 1, further comprising: 

at least one additional node in at least one cluster wherein said additional node includes a 
repository agent. 

5. The system of claim 4, wherein said repository agent forwards all write/update requests 
to said primary repository manager. 

6. The system of claim 4, whereia said repository agent includes a software cache of 
repository data, wherein said repository data may be quickly accessed by an application. 
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7. The system of claim 1, wherein said primary repository manager manages the storage of 
repository data on a first computer-readable medium, the maintenance of repository data on 
memory, and the synchronization of repository updates. 

8. The system of claim 7 wherein said secondary repository manager manages the storage 
of repository data on a second computer-readable medium, and the maintenance of repository 
data on memory. 

9. The system of claim 8 wherein the repository data in said secondary node is 
synchronously up-dated so as to remain consistent with the repository data of said first node. 

10. The system of claim 8 wherein said first and second computer-readable mediums each 
include a disc. 
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11. A method of providing real-time cluster configuration data within a clustered computer 
network comprising a plurahty of clusters, comprising the steps of: 

choosing a primary node in each cluster wherein said primary node includes a primary 
repository manager; 

choosing a secondary node in each cluster wherein said secondary node includes a 
secondary repository manager; and 

causing said secondary repository manager to cooperate with said primary repository 
manager to maintain information at said secondary node consistent with information maintained 
at said primary node. 

12. The method of claim 1 1, comprising the fixrther step of: 

providing a repository agent for each additional mode of each cluster, wherein the 
repository agent interfaces with the primary repository manager in its cluster. 

1 3 . The method of claim 1 1 , comprising the further steps of: 

sending write/update information from a cUent only to said primary repository manager; 

causing said write/update information to be written in said primary repository manager 
and said secondary repository manager; and 

validating completion of the entry of said write/update information only when the 
information successfully is written in both said primary repository manager and said secondary 
repository manager. 

14. A computer program product comprising. a computer useable medium having computer 
readable code embodied therein for providing real-time cluster configuration data within a 
clustered computer network comprising a plurahty of clusters, the computer program product 
adapted when run on a computer to effect steps including: 

choosing a primary node in each cluster wherein said primary node includes a primary 
repository manager; 

choosing a secondary node in each cluster wherein said secondary node includes a 
secondary repository manager; and 
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causing said secondary repository manager to cooperate with said primary repository 
manager to maintain information at said secondary node consistent with information maintained 
at said primary node. 

15. The computer program product of claim 14, wherein the computer program product is 
adapted when run or a computer to effect the further steps of 

providing a repository agent for each additional node of each cluster, wherein the 
repository agent interfaces with the primary repository manager in its cluster. 

16. The computer program product of claim 14, comprising the further steps of 

sending write/update information from a client only to said primary repository manager; 

causing said write/update infomaation to be written in said primary repository manager 
and said secondary repository manager; and 

validating completion of the entry of said write/update information only when the 
information successfully is written in both said primary repository manager and said secondary 
repository manager. 

17. A computer program product comprising a computer useable medium having computer 
readable code embodied therein for providing real-time cluster configuration data within a 
clustered computer network comprisiag a plurality of clusters, the computer program product 
comprising: 

means for choosing a primary node in each cluster wherein said primary node includes a 
primary repository manager; 

means for choosing a secondary node in each cluster wherein said secondary node 
includes a secondary repository manager; and 

means for causing said secondary repository manager to cooperate with said primary 
repository manager to maintain information at said secondary node consistent with information 
maintained at said primary node. 

18. The computer program product of claim 17, further comprising: 

means for providing a repository agent for each additional mode of each cluster, wherein 
the repository agent interfaces with the primary repository manager in its cluster. 
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19. The computer program product of claim 17, further comprising: 

means for sending write/update information from a cUent only to said primary repository 
manager; 

means for causing said write/update information to be written in said primary repository 
manager and said secondary repository manager; and 

means for validating completion of the entry of said write/update information only when 
the information successfully is written in both said primary repository manager and said 
secondary repository manager. 
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